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formulas  for  tho  parameters  based  on  tho  beta  function  as  N 
becomes  large.  An  upward  bias  results  when  these  foraulas 
are  applied  to  the  Monte*Carlo  data.  The  author* 


ABSTRACT 


This  dissertation  explores  the  distributional  properties 
of  coaaonly  used  statistics  developed  in  the  course  of 
empirical  aodel  building.  A  review  of  soiae  of  the  more  note¬ 
worthy  efforts  to  investigate  the  distribution  of  the  coef¬ 
ficient  of  determination,  R2,  in  best  subset  regression  is 
given.  To  overcoae  the  shortcoaings  of  these  results,  a 
perautation  test  based  on  Fisher's  randomization  test  is 
developed  to  provide  a  practical  basis  for  assessing  the 
statistical  significance  of  a  regression  in  such  situations. 

An  investigation  is  aade  into  the  distributional  pro¬ 
perties  of  the  multiple  correlation  coefficient  in  the 
choice  of  a  transformation  of  the  dependent  variable,  y. 

The  study  investigates  the  possibility  that  pedestrian  use 
of  transformations,  such  as  y*  ■  (y+c)p,  may  lead  to  an 
inflationary  effect  on  the  sample  correlation. 

A  practical  management  science  application  of  the 
statistical  procedures  developed  in  this  study  is  explored 
in  the  area  of  parametric  cost  estimation. 


TABLE  OF  CONTENTS 


Page 

TITLE  PAGE .  i 

ABSTRACT . ii 

ACKNOWLEDGMENTS  . .  iii 

LIST  OF  TABLES  .....  .  v 

LIST  OF  FIGURES .  vi 

CHAPTER 

I.  INTRODUCTION  .  1 

II.  APPROXIMATIONS  OF  THE  DISTRIBUTION  OF  R2 

IN  BEST  SUBSET  REGRESSION  .  6 

III.  SIGNIFICANCE  TESTS  AND  TESTS  OF  MODELS 

IN  SUBSET  REGRESSION  .  27 

IV.  POWER  TRANSFORMATIONS  OF  BIVARIATE 

SAMPLES .  38 

V.  APPLICATIONS  OF  PERMUTATION  TEST .  54 

VI.  CONCLUSIONS .  65 

BIBLIOGRAPHY  .  67 


LIST  OF  TABLES 


Table 


I.  Percentage  Points  of  r2(8,4)  and  the 
Gana  Density . 

2 

II.  Percentage  Points  of  r  (8,4),  Ganna 
and  (n-l)r|(8,4)  . 

III.  Fraction  Rejected  at  .05  Significance 
Level  . 

IV.  Fraction  Rejected  at  .05  Significance 
Level  for  pz  ■  .4  -  Additional  Data 


V.  Values  fron  Sinulation  Run 

VI.  Cost  Data  for  Subsystem  A 

VII.  Cost  Data  for  Subsystem  B 


LIST  OF  FIGURES 


Figure  Page 

1.  Relative  Frequency  of  r2 (8,4)  .  21 

2 

2.  Relative  Histogram  of  r  (8,4)  and  Gamma 

Density .  22 

3.  Relative  Frequency  of  9  r^0(8,4) .  23 

4.  Relative  Frequency  of  24  ^(-(8,4) .  24 

5.  Relative  Frequency  of  49  Tjq(8,4) .  25 

2 

6.  Relative  Frequency  of  99  r£00(8,4)  26 

7.  Plot  of  R(i)  in  the  Set  R .  32 

8.  The  Bulging  Rule .  39 

9.  Scatter  Plot  of  30  Standardized  Values 

of  R"  vs  R .  41 

10.  Scatter  Plot  of  30  Absolute  Standardized 

Values  of  Residuals  vs  R .  42 

11.  Scatter  Plot  of  10  Standardized  Values 

of  Y*  vs  X .  46 

12.  Scatter  Plot  of  10  Standardized  Values 

of  Residuals  vs  X .  47 

13.  Scatter  Plot  of  10  Standardized  Values 

Y  vs  X .  51 

14.  Scatter  Plot  of  10  Standardized  Values 

Y  vs  X .  53 

2 

15.  Stem  end  Leaf  of  R  Values  for  Subsystem  A  .  .  60 

2 

16.  Stem  and  Leaf  of  R  Values  for  One- variable 

Models .  61 

2 

17.  Stem  end  Leef  of  R  Velues  for  Two-veriable 

Models .  62 


CHAPTER  I 


INTRODUCTION 

The  most  widely  used  and  abused  data  analytic  metho- 
dology  is  regression  analysis  (4).  Many  books,  notably  (7), 
(9)  and  (IS),  and  thousands  o£  research  papers  attest  to  the 
popularity  and  importance  of  these  powerful  statistical 
procedures.  The  advent  of  high-speed  digital  computers  and 
associated  statistical  software  packages  has  made  regression 
analysis  accessible  to  users  in  all  fields  of  research.  In 
particular  the  new  technological  developments  in  time- 
shared  computing  literally  bring  these  and  other  procedures 
into  the  manager's  office  providing  the  means  for  assessing 
decision  alternatives  at  a  moment's  notice.  Sophisticated 
techniques,  now  routinely  applied,  were  impractical  only  20 
years  ago  because  of  enormous  computational  requirements. 

For  some  methodologies,  in  particular  empirical  model 
building,  statistical  theory  is  not  keeping  pace  with  ever- 
expanding  computational  capabilities  in  the  sense  that  data 
analysts  are  developing  and  using  algorithms  which  lead  to 
results  whose  statistical  properties  are  not  fully  under¬ 
stood.  This  statement  is  not  intended  as  a  criticism  of 
exploratory  data  analysis  per  se,  but  it  does  identify  an 
area  of  practical  significance  whose  theoretical  foundation 
is  shaky  at  best. 
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Unlike  confirmatory  statistical  techniques  (such  as 
hypothesis  testing) ,  wherein  inferences  are  made  within  the 
framework  of  a  given  model,  the  term  "empirical  model 
building"  is  used  to  describe  the  process  of  "letting  the 
data  speak  for  itself."  In  searching  for  possible  relation¬ 
ships  among  a  collection  of  variables,  the  data  analyst  may 
allow  the  sample  data  to  answer  such  questions  as,  "Which 
variables  should  be  included  in  the  model?"  and  "What  model 
structures  should  be  contemplated?" 

The  purpose  of  this  dissertation  is  to  explore  the  dis¬ 
tributional  properties  of  commonly  used  statistics  developed 
in  the  course  of  empirical  model  building.  Theoretical 
results  are  obtained  in  certain  tractable  cases.  Simulation 
is  employed  to  develop  insight  in  those  situations  where 
explicit  mathematical  results  have  been  elusive. 

It  is  well  known  that  the  use  of  empirical  variable 
selection  techniques  in  multiple  regression  leads  to 
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inflated  values  of  the  coefficient  of  determination,  R  . 

The  degree  of  this  inflation  is  not  well  understood.  What 

makes  the  problem  difficult  is  the  fact  that  the  distribu- 
2 

tion  of  R  depends  not  only  on  the  underlying  relationship 
among  the  variables,  but  on  the  data  analytic  tools  used  to 
develop  the  model.  Attempts  have  been  made  to  obtain 
approximations  and  asymptotic  results  for  special  cases  of 
this  problem.  Chapter  II  reviews  some  of  the  more  note¬ 
worthy  efforts  to  investigate  the  distribution  of  R2  in  best 
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subset  regression.  Best  subset  regression  is  concerned  with 
the  problem  of  determining  the  subset  of  size  k  out  of  p 
candidate  predictor  variables  which  maximizes  some  function 
of  R  ,  where  k  itself  may  be  data -dependent .  Special 
attention  is  allotted  an  asymptotic  result  of  Alam  and 
Wallenius  (2) .  A  proof  of  their  result  is  provided.  Since 
their  asymptotic  distribution  of  R  is  derived  by  allowing 
the  sample  size  to  grow  large,  an  investigation  is  performed 
to  determine  an  appropriate  sample  size  for  an  adequate 
approximation. 

The  approximations ,  alluded  to  above  provide  insight  but 

little  help  of  a  practical  nature  in  testing  for  statistical 

2 

significance  of  the  sample  R  resulting  from  data  analytic 
selection  techniques.  Chapter  III  addresses  this  problem. 

The  shortcomings  of  the  classical  statistical  tests  for 
these  situations  are  reviewed.  In  order  to  overcome  these 
limitations,  a  new  approach  is  introduced  which  yields  an 
exact  test  conditioned  on  the  sample  data  and  selection 
technique.  This  test  is  most  useful  in  situations  where 
the  number  of  observations  is  small  compared  tj  the  number 
of  candidate  predictor  variables.  In  particular,  this  test 
is  valid  if  the  number  of  potential  predictor  variables 
exceeds  the  available  degrees  of  freedom.  The  classical  F 
test  cannot  be  used  in  this  case.  Determining  the  power  of 
this  test  is  a  difficult  problem  and  remains  unsolved.  A 
simulation  is  used  to  compare  the  power  of  the  new  test  to 


that  of  tha  classical  F  test  for  several  cases  where  the 
latter  is  valid. 

Chapter  IV  deals  with  the  distribution  of  the  aultiple 
correlation  coefficient  in  aultiple  regression  when  the  data 
is  used  to  determine  the  choice  of  a  transformation  of  the 
dependent  variable  y.  The  family  of  transformations  con¬ 
sidered  is  of  the  form  y*  -  (y+c)p.  This  is  a  widely  used 
family  of  transformations  of  practical  importance.  It  is 
often  used,  as  Tukey  (19)  puts  it,  "to  remove  apparent  ills 
from  the  data  . . .  aiding  in  the  analysis  by  bending  the  data 
nearer  the  Procrustean  bed  of  the  assumptions  underlying 
conventional  analysis."  The  data  is  employed  to  determine  c 
and  p  in  such  a  way  that  the  relationship  between  y*  and  a 
single  predictor  x  is  nore  nearly  linear  than  that  between  y 
and  x.  The  study  investigates  the  possibility  that  pedes¬ 
trian  use  of  this  transformation  may  lead  to  an  inflationary 
effect  on  the  sample  correlation.  The  results  indicate  some 
interesting  phenomena  which  are  illustrated  in  examples  and 
lead  to  a  theorem. 

Chapter  V  explores  a  practical  management  science 
application  of  the  statistical  procedures  developed  in  this 
study.  A  problem  often  faced  by  costing  and  pricing 
analysts  involves  estimating  the  cost  of  a  proposed  system. 
One  approach  to  this  problem  is  independent  parametric  cost 
estimation.  A  description  of  this  method,  its  advantages, 
and  disadvantages  are  given.  Actual  cost  and  performance 
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data  obtained  from  the  Navy  Weapons  Center,  China  Lake, 
California,  are  analyzed  using  the  methodology  of 
Chapter  III. 

Chapter  VI  contains  a  discussion  of  the  inherent 
difficulties  of  empirical  model  building  and  identifies 
some  areas  in  which  further  research  is  required. 


CHAPTER  II 


APPROXIMATIONS  OF  THE  DISTRIBUTION  OF  R2 
IN  BEST  SUBSET  REGRESSION 

In  recent  years  a  great  deal  of  interest  has  been 
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expressed  in  the  distributional  properties  of  R  and  other 
statistical  measures  of  fit  for  regression  models  when  vari¬ 
able  selection  techniques  are  employed.  Historically, 

? 

Fisher  (10)  derived  the  general  sampling  distribution  cf  R 

when  sampling  from  a  multivariate  normal  distribution.  The 

2 

distribution  theory  for  the  sample  R  statistic  in  empirical 
model  building  is  quite  complex.  The  difficulty  stems  from 
the  fact  that  the  distribution  depends  not  only  on  the 
underlying  relationship  among  the  variables  but  also  on  the 
variable  selection  criterion. 

This  chapter  reviews  some  interesting  approximating 

2 

formulas  and  asymptotic  results  for  the  distribution  of  R 
in  best  subset  regression.  Here  the  term  "best  subset 
regressions"  refers  to  the  following  situations.  The  data 
analyst  has  a  set  of  n  independent  observations  on  p  candi¬ 
date  predictor  variables  and  one  dependent  variable.  The 
goal  of  the  analysis  is  to  determine  the  k-variable 
regression  equation  which  maximizes  the  sample  coefficient 
of  determination  for  various  values  of  k.  The  difficulty 

with  this  analysis  is  assessing  the  statistical  significance 
2 

of  R  for  a  given  value  of  k. 


Diehr  and  Hoflin  (8)  utilize  a  Monte-Carlo  approach  to 

devise  a  function  purported  to  estimate  the  distribution  of 

the  sample  R2  in  best  subset  regression  for  samples  selected 

from  a  p+1  dimensional  multivariate  normal  population  with 

zero  mean  and  identity  covariance  matrix.  Monte-Carlo 

estimates  of  the  (1-a)  percentile  points,  R  (k,p,n,a),  of 

the  sample  distribution  are  obtained  for  selected  values  of 

k,  p,  and  n.  This  is  accomplished  by  generating  100  samples 

of  size  n  from  the  null  distribution  and  determining  and 

2 

saving  the  maximum  R  associated  with  the  best  k-variable 

2 

regression  equation  for  k  from  1  to  p.  The  set  of  100  R 
values  corresponding  to  a  particular  collection  of  k,  p,  and 
n  values  are  ordered  to  give  estimates  of  the  percentage 
points.  By  visually  examining  some  of  the  Monte-Carlo 
results,  the  authors  note  that  a  function  of  the  form 

R2(k,p,n,a)  »  w(l-vk) 

seems  to  provide  a  reasonable  fit  of  the  simulation  results 
when  w  and  v  are  determined  from  the  known  boundary  values 
R  (l,p,n,a)  and  R  (p,p,n,a).  The  authors  suggest  that  a 
statistical  test  based  on  this  formula  is  an  improvement 
over  the  standard  tests  for  the  empirical  researcher  since 
the  number  of  independent  variables  which  has  been  searched 
is  taken  into  account.  This  test,  while  more  appropriate 
than  the  F  test  in  spirit  at  least,  would  serve  only  to  give 
insight  into  the  results.  The  nominal  "significance  level" 


is  somewhat  suspect  since  the  percentile  points  are  based 

on  an  ad  hoc  fit. of  a  Monte-Carlo  distribution. 

Rancher  and  Pun  Fu-Ceayong  (16)  extend  the  results  of 

2 

Diehr  and  Hoflin  by  computing  the  mean  of  the  inflated  R 
under  best  subset  selection ,  allowing  for  correlated  pre¬ 
dictor  variables,  and  including  the  situation  where  the 
number  of  candidate  predictor  variables  exceeds  the  number 
of  observations . 

As  expected,  their  Monte -Carlo  study  indicates  that 
2 

the  inflation  of  R  is  somewhat  less  when  the  predictor 

variables  are  intercorrelated.  To  supplement  Monte-Carlo 

estimates  for  the  mean  and  percentage  points  of  the  distri- 
2 

bution  of  R  under  selection,  the  authors  obtain  asymptotic 

approximations  for  these  parameters.  For  a  k- variable 

2 

model  without  selection,  R  has  a  beta  distribution  in  the 

2 

null  case  (10).  Thus,  the  distribution  function  of  R  is 
given  by 

Fa,bCR2)  ’  6Cr2;  a»b)/BU;  a,b) 

where,  for  0  <  x  <  1, 

>x 

0(x;  a,b)  -  ta"1(l-t)b"1dt, 

0 

a  ■  k/2  and  b  ■  (n-k-l)/2. 

The  number  of  possible  k-variable  prediction  equations 

is  N  -  p!/[k! (p-k) ! ] .  By  assuming  the  corresponding  N  values 
2 

of  R  are  independent,  the  authors  obtain  asymptotic 


formulas  for  the  parameters  based  on  the  beta  function  as  N 
becomes  large.  An  upward  bias  results  when  these  formulas 
are  applied  to  the  Monte-Carlo  data.  The  authors  attribute 
this  bias  to  the  assumption  of  independence  noted  above. 

The  formulas  are  modified  to  correct  for  this  biasedness  by 
adjusting  the  value  of  N  via  a  function  of  the  form 
cNd 

( In  N)  ,  where  c  and  d  are  empirically  determined  from  the 

Monte-Carlo  results.  Their  final  approximating  formulas 
for  the  mean  and  Y-th  percentile  of  the  distribution  R  are, 
respectively, 

E(R2)  -  1-Fj^a[l/ Un  N)1*SN‘°4]r(l*l/w) 

and 

R*  -  FaV1  +  *n  Y/(*"  N)1,8N‘04] 

where  w  ■  (n-k-l)/2. 

It  is  suggested  that  these  formulas  can  be  used  as 

2 

possible  guidelines  for  assessing  the  significance  of  R 
values  obtained  in  best  subset  regression  applications. 
However,  the  empirical  researcher  might  feel  that  his  con¬ 
fidence  in  their  use  is  overshadowed  by  their  ominous 
appearance  and  computational  complexity. 

Zirphile  (23)  derives  an  asymptotic  approximation  for 

2 

the  (1-a)  percentiles  of  the  distribution  of  R  in  best 
subset  regression,  as  the  sample  size  n  is  made  large,  using 
extreme  value  theory.  This  approximating  formula  gives 
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percentage  points  which  are  as  large  as  1.5  for  some  small 
values  of  n  (16) .  This  poor  performance  may  be  due  in  part 
to  the  fact  that  his  results  are  based  on  the  assumption 
that  under  the  null  hypothesis  the  asymptotic  distribution 
of  for  a  k- variable  equation  without  selection  is  normal. 
The  actual  distribution  does  not  tend  to  normality  for  large 
n  under  the  null  hypothesis  (10)  so  that  Zirphile's  results 
are  of  dubious  value. 

A lam  and  Wallenius  (2)  derive  a  very  interesting 
asymptotic  result  in  the  following 

Theorem:  Let  Z'  -  (Y,Xj ,X2 , . . . ,Xp)  have  a  (p+1) -variate 
normal  distribution  with  arbitrary  mean  vector  ^  and  diago¬ 
nal  covariance  matrix  J.  Given  a  sample  of  size  n  on  Z, 

m 

let  r*  z  .  denote  the  sample  multiple  correlation 

11 9 Z  9  *  *  *  9  K 

coefficent  between  y  and  the  k  predictor  variables  x^  ,x^  , 

1  2 

2 

....Xj  where  1  <  k  <  p.  Let  rn(p,k)  denote  the  maximum 

of  all  (F)  values  of  r.  .  *  .  Then  as  n  tends  to 

K  *1 »x2*  *  *  *  * xk 

infinity,  (n-1)  r^(p,k)  converges  (with  probability  1)  to 
a  random  variable  r  (p,k)  distributed  as  the  sum  of  the  k 
largest  order  statistics  of  a  random  sample  of  size  p  from 
a  chi-square  distribution  with  one  degree  of  freedom. 

Proof:  Let  X  -  (Xx ,X2 , . . . ,Xp+1) •  -  MVN(p,I)  and 
assume  l  is  of  the  form 


We  nay  assume,  without  loss  of  generality,  that  ^  * 
Let  x^,  x 2 »  •••»  be  a  random  sample  size  n  on  X.  Let 


a  -  ?  I  (ii-DtSi-s)' 

-  i-i  j-i  _1  >  ~ 

where 


n 

x  -  1/n  l  x. . 

j-i  -J 


Next,  partition  A  as 

■ 


'  < 

all  —12 

—21  .22 

t  4 

and  let  jr»  ■  •  •  •  »x^n)  t*ie  **-rst  component  of 
each  sample  observation  Xj,  i  -  1,2,..., n.  Consider  the 
conditional  distribution  of 
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*21 
*31 

*il 

*(p+l)l 

given  Y  ■  £• 

The  matrix 

* 

xi  y2  ...  yj  ...  yn 

X21  x22  *  *  *  x2j  ’  *  *  x2n 

X  -  • 

t  *  •  • 

•  •  •  • 

x(p+l)l  x(p+l) 2  •**  x(P+l)j  ,,,x(p+l)n 

is  coluan>wise  independent  and  the  jth  coluan  is  conditionally 

N(®u  y**  -22  "  -21  0l*  -lz)’ 

However,  since  we  are  assuaing  j>21  *  0,  we  have 
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n  _ 

Since  a41  -  I  (y,--y)*4.i»  it  follows  that 
11  j-1  3  13 


i)  E(au|Y-£)  -  0. 

n  n 

ii)  cov(au,akll)r)  -  ^.^Cyj-y)  (yA-y)  cov(x.j,xk4l)r). 

Note  that  cov(x.j  ,xkjJy)  *  0  unless  j  -  t.  Thus 

n  2 

cov(ail»akl^  "  covCxjj.Xjjjljr) 


"  all°ik* 


So,  given  Y  ■  £,  A2^  -  N(0,  a^  «22^  *  Reca11  that 

-i  r1/2 11/2  -i  i1/2  r1/2 

p2  .  -12  &22  —21  ,  —12  -22  -22  ^22  -22  -22  ^21 
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Then,  given  Y  •  £, 

1.  z  -N(0,J) 


^2  y1/2  -i  y1/2 

2-  S  -  i'  i22  422  :22  L- 

Thus,  (n-l)p2  -  z-  I22  (sfr  A22)*1  £2  £. 

1  -l  y - 1 

Since  (jjrr  $22^  ats.  -22 »  we  have  for  lar*e  n» 
given  Y  -  jr, 

2 


(n-l)p‘ 


a 


2  2 

%  m  l  1  i  *  X^(P)  Vy. 

•s*  i-2  1 


That  is,  the  asymptotic  conditional  distribution  does  not 
depend  on  the  conditioning  value  of  Y.  Therefore,  the 
distribution  of  (n-l)p2,  for  large  n,  is  chi>square  with 
p  degrees  of  freedom. 

Suppose  we  wish  to  consider  all  (£)  subsets  of  size  k 

from  the  set  {x2  ,x3, . .  .Xp+^}  of  predictor  variables  and 

compute  the  sample  multiple  correlation  coefficient  between 

y  and  each  such  subset.  Let  R4  .  .  denote  the 

*1»*2 » • • • » *k 

multiple  correlation  between  y  and  the  set  { xt  ,x.  , . . . ,x4  } 

*1  i2  *k 


and  let 


R(k)  -  max  R<  .  , 

{ix,i2»"..ik}  1l»12»-#,1k' 


IS 


From  above,  we  have  for  large  n 


(n‘1)R2...(p+l)  m  z2  *  *3  * 


. .  ♦  z 


p+1 


where 


-  N(0, 1) . 


Thus 


-  -1-  (l*i'  ♦  ^lUU-) 

2...(p+l)  *n  ^22  °33  ®«*i  «*i 


fp+l,p+l 


for  large  n.  More  generally,  we  see  that 


*i  i  *11 

r-  1  \I»2  _  1  f  111  .  i21  . 

In  1)R,  .  •  ■  — —  (- —  —  ♦  - -  ♦ 

i112...ik  *n  ai,i. 


1*1 


2  2 


av 

+  ®ii  } 

1k1k 
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Note  that  {—-£=-  :  j  ■  2,3,...,p+l)  represents  a  random 

2  2 

sample  of  size  p  from  a  x  U) •  Thus  (n-l)R  (k)  is  the  sum 


®4  1 

of  the  k  largest  values  of  (- — )  or  the  sun  of  the  k 

au°jj 

largest  order  statistics  of  a  random  sample  of  size  p 

from  a  chi-square  distribution  with  one  degree  of 

freedom.  Q.E.D. 

This  theorem  is  of  particular  interest  since  it  pro¬ 
vides  information  about  the  degree  of  inflation  of  the  F 
statistic  in  best  subset  regression  as  a  corollary.  Under 
the  null  hypothesis  for  a  particular  k-variable  model,  the 

F  statistic  is  distributed  as  a  constant  times  the 

2 

quotient  of  independent  x  random  variables,  that  is 
F  -  «2tlO/k _  . 


For  large  n,  by- Theorem  20.6  of  (6)  the  denominator 
converges  in  probability  to  1.  Thus,  F  converges  in  dis- 
tribution  to  a  random  variable  distributed  as  a  x  (k)/k  or, 
equivalently,  as  the  average  of  k  independent  observations 
from  a  chi-square  distribution  with  one  degree  of  freedom. 

If  the  best  subset  of  size  k  out  of  p  predictors  is 
selected,  the  associated  F  statistic  is  given  by 

rjtp.M  -n-l-kv  .  1  .(■•»>»(>.«. 

l-r‘(p,k)  *  *  l-r‘(p,k)  1-r‘Cp.k) 

Since  r*(p,k)  converges  to  zero  as  a  direct  result  of 
the  Alam-Wallenius  theorem,  F  converges  in  distribution 
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t°  £r2(p,k)  which  is  distributed  as  £  jEjXjp-i+l]  »  the 

average  of  the  k  largest  of  p  independent  observations  from 
a -chi-square  distribution  with  one  degree  of  freedom.  This 
coaparison  of  F  and  Faax  gives  the  clearest  picture  of  the 
nature  of  the  inflation  of  R2  in  best  subset  regression. 

The  saaple  size  necessary  for  an  adequate  approxi- 
nation  by  an  asymptotic  methodology  is  always  of  priae 
concern.  An  investigation  into  this  question  is  nade  by 
aeans  of  a  Monte-Carlo  approach  for  the  Alaa-Wallenius 
result.  A  simulation  is  performed  in  a  straight-forward 
manner  to  compare  the  distributions  of  the  two  statistics 
involved.  A  random  saaple  of  size  p  is  selected  from  a 
chi-square  distribution  with  one  degree  of  freedom.  The 

k  largest  values  in  the  saaple  are  added  together  to  yield 

2 

one  observation  on  the  statistic  r  (p,k).  This  procedure 
is  repeated  1000 'tines,  and  a  relative  frequency  histogram 
is  developed.  Figure  1  shows  such  a  histogram  for 
r2(8,4). 

Alan  and  Wallenius  (3)  show  that  the  distribution 
function  of  statistic  r  (p,k)  can  be  expressed  as  an  in¬ 
finite  linear  combination  of  gamma  distribution  functions. 
The  shape  of  Figure  1  resembles  the  gamma  density.  For 
these  reasons,  an  attempt  is  made  to  fit  a  gamma  density  to 
the  simulation  results.  Recall  that  the  gamma  density  is  a 
two-parameter  function  which  may  be  written  as 
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8°  T(a) 


xa-i  e-x/0 


f(x)  » 


x  >  0 


otherwise 


The  aethod  of  aoaents  is  applied  to  the  data  presented 

Ak 

in  Figure  1  to  obtain  estiaates  of  a  and  0:  a  *  3.6  and 

2  V 

0  -  2.0.  Figure  2  depicts  the  histograa  of  r  (8,4)  with 
this  gaaaa  density  superiaposed.  Table  1  gives  percentage 
points  for  the  saaple  data  and  this  gaaaa  density. 


The  distribution  of  r*(8,4)  appears  very  siailar  to 
that  of  a  randoa  variable  distributed  as  a  gaaaa  with 
o  ■  3.6  and  0  ■  2.  It  appears  that  the  infinite  sun 
aentioned  above  aay  be  doainated  by  a  single  gaaaa  distri¬ 
bution  function. 

To  obtain  an  enpirical  distribution  for  the  statistic 
2 

(n-l)rn(p,k) ,  a  randoa  saaple  of  size  n  is  selected  froa  a 
p*l  diaensional  aulti variate  noraal  population  with  zero 
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mean  and  identity  covariance  matrix.  The  best  k- variable 
regression  equation  is  then  determined  by  use  of  an  effi¬ 
cient  search  of  the  (£)  possible  regressions,  and  the 

2 

resulting  value  of  (n-l)rn(p,k)  is  saved.  This  process  is 

repeated  500  times  resulting  in  relative  frequency  dis- 

2 

tributions  of  (n-l)rn(p,k) .  For  p  ■  8  and  k  •  4,  Figures 
3-6  depict  histograms  of  these  frequencies  for  n  -  10,  25, 

50  and  1 00,  respectively,  with  the  superimposed  density 
of  the  gamma  distribution. 

The  results  of  this  simulation  offer  no  definitive 
answer  to  the  question  of  appropriate  sample  size.  However, 
it  is  possible  to  form  some  conclusions  after  visually 
examining  the  histograms.  In  a  hypothesis  testing  frame¬ 
work,  the  right- tail  of  the  distributions  will  be  important 

2 

in  the  decision  making  process.  The  statistic  r  (p,k) 
seems  to  overestimate  the  probability  in  the  right-tail 
for  small  values  of  n  as  can  be  seen  in  Table  II.  A  statis¬ 
tical  test  based  on  this  distribution  would  appear  to  be  a 
conservative  test  for  small  values  of  n  in  that  the  actual 
significance  level  is  less  than  the  nominal  level. 
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TABLE  II. 

Percentage  Points  of 

r2(8,4). 

Gamma  and  (n-l)r2(8,4) 

901 

951 

991 

r2(8,4) 

12.15 

14.18 

18.91 

gamma 

12 . 29 

14.36 

18.80 

9  rj0(8,4) 

7.85 

8.14 

8.69 

24  r|s(8,4) 

9.60 

10.54 

13.13 

49  r|0(8,4) 

10.39 

11.82 

15.46 

99  rJ0„C8,4) 

10.41 

11.91 

15.68 

The  approximations  presented  in  this  chapter  show  the 

approaches  that  have  been  used  to  explore  the  distributional 
2 

properties  of  R  under  best  subset  regression.  These 
results  provide  a  better  understanding  but  afford  little 

help  of  a  practical  nature  in  testing  for  statistical 

2 

significance  of  the  sample  R  resulting  from  data  analytic 
selection  techniques.  In  the  next  chapter,  a  new  and  exact 
method  to  deal  with  this  important  problem  is  developed. 


i 


Figure  1:  Relative 
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r2(8,4) 

Figure  2:  Relative  Histogram  of  r2(8,4) 
and  Gamma  Density 
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Relative  Frequency  of  49  r|Q(8,4) 


CHAPTER  III 


SIGNIFICANCE  TESTS  AND  TESTS  OF  MODELS 
IN  SUBSET  REGRESSION 

In  Chapter  II  some  asymptotic  approximations  for  the 

2 

null  distribution  of  R  in  the  case  where  the  predictor 
variables  are  orthogonal  were  discussed.  In  this  chapter, 
the  focus  is  on  exact  statistical  tests  for  the  finite 
sample  size  case  and  the  result  is  generalized  to  include 
nonorthogonal  predictor  variables.  These  results  will 
provide  a  practical  basis  for  assessing  the  statistical 
significance  of  a  regression  developed  by  any  empirical 
selection  method. 

Theoretical  considerations  often  suggest  the  important 
independent  variables  and  the  functional  form  of  the 
relationship.  Models  based  on  theoretical  considerations 
are  the  exception  rather  than  the  rule  in  most  practical 
managerial  problems.  As  a  result,  the  set  of  possible 
predictors  may  be  quite  large  and  the  problem  of  selecting 
a  "best"  set  becomes  a  difficult  task.  There  are  a  number 
of  articles  in  the  literature,  notably  (11),  (17)  and  (18), 
describing  this  problem  and  offering  various  criteria  to  be 
used  to  determine  the  variables  to  be  included.  Lindley  (13) 
emphasizes  that  the  selection  criterion  should  be  related  to 
the  intended  use  of  the  model.  Hocking  (11)  gives  a 


description  of  these  potential  uses  which  include  des¬ 
cription,  prediction,  and  control.  It  is  generally  recog¬ 
nized  that  a  universally  best  criterion  for  selecting  a  set 
of  predictor  variables  does  not  exist.  It  is  not  the  purpose 
of  this  chapter  to  discuss  the  advantages  and  disadvantages 
of  various  selection  procedures,  but  to  determine  a  way  of 
evaluating  the  statistical  significance  of  the  resulting 
model. 

A  commonly  used  selection  procedure  involves  deter¬ 
mining  the  adjusted  multiple  coefficient  of  determination, 
R*(k)  ,  for  all  2P_1  subsets  of  the  predictor  variables 
where 

R^(k)  -  1  - 


2 

and  R  (k)  is  the  coefficient  of  determination  for  the  model 

with  k  predictor  variables.  The  subset  chosen  is  that  with 
2  2 

maximal  R*(k) .  In  fact,  R  (k)  plays  a  central  role  in 

2 

almost  all  selection  criteria.  This  value  of  R  (k) ,  as 
previously  shown,  can  be  misleadingly  large.  How  large 
must  it  be  to  be  judged  statistically  significant?  What 
makes  the  question  hard  to  answer  is  the  fact  that  the 
distribution  of  R  (k)  for  the  selected  model  depends  on  the 
underlying  relation  among  the  variables  as  well  as  the 
selection  criterion. 

If  p  variables  are  being  considered  and  all  are 
included  in  the  model,  the  classical  F  test  is  appropriate 
provided  p  <  n-1.  The  test  of 
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0 


is  equivalent  to  testing 


H0:  Pi  ’  P2  "  "  Op  "  0 

where  is  the  simple  correlation  between  the  dependent 
variable,  y,  and  the  ith  predictor,  x^.  The  null  hypothesis 
is  rejected  if 

2 

F  -  - jy.CE) -  >  F(o,p,n-p-l)  . 

(l-Rz)/n-p-l) 

2 

If  R  is  significant  for  the  full  model,  a  reduction 

in  the  number  of  predictors  used  is  usually  called  for 

since  a  model  with  many  independent  variables  is  expensive 

to  maintain,  difficult  to  analyze  and  interpret,  and 

almost  always  results  in  larger  predictor  variances  (21) 

than  a  suitably  selected  submodel.  The  application  of  a 

selection  procedure  to  obtain  a  "best"  submodel  may  result 

in  a  submodel  which  is  no  longer  statistically  significant. 

Cramer  (5)  suggests  that  it  is  possible  for  the  value  of 

R2  for  the  full  model  to  be  statistically  significant 

while  none  of  the  regression  coefficients  have  individually 

significant  t  values.  In  this  situation,  each  predictor  is 

making  its  own  independent,  albeit  slight,  contribution  so 

that  the  total  effect  is  statistically  significant.  It  My 

not  be  possible  to  eliminate  any  variable  or  set  of  vari- 

2 

ables  so  as  to  Mintain  a  significant  R  . 
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The  F  test  cannot  be  used  if  the  number  of  predictor 
variables  is  larger  than  n-2.  Common  sense  and  sound 
statistical  practice  require  selection  of  a  subset  of  pre¬ 
dictor  variables.  When  this  selection  is  done  empirically, 

2 

as  we  have  seen,  R  (and  hence  F)  becomes  inflated  so  that 
the  standard  tests  are  invalid.  Even  though  the  distri- 
butional  properties  of  R  under  variable  selection  are 
hopelessly  complex,  an  exact  conditional  test  is  derived 

which  is  valid  for  any  variable  selection  technique. 

2 

Consider  the  hypothesis  Ho:  p„  ,,  _  »  0.  Under  Ho, 

the  joint  distribution  of  X  and  Y  is  invariant  under  per¬ 
mutations  of  Y.  Since  there  are  n  observations,  there  are 
N  »  n!  possible  permutations  of  the  y  values.  If  the  par¬ 
ticular  variable  selection  method  being  used  is  applied  to 
each  of  these  permutations,  a  set  of  N  corresponding  values 
of  R  could,  in  principle,  be  generated.  Let  R  (i), 

i  ■  1,2 . N,  be  the  ith  smallest  value  in  this  set  of  N 

values,  and  let  R  denote  the  collection  of  order  statistics 
so  obtained.  Let  R<j  denote  the  value  of  R  associated  with 
the  unpermuted  y  values.  R£  e  R  and,  by  invariance, 

Prob(R2  >  R2(N-m))  ■  m/N  for  1  <  m  <  N  so  that  the  critical 
region  RJ  >  R  (N-m)  yields  an  exact  level  a  »  m/N  test  of 
Ho.  This  test  is  a  special  case  of  Fisher's  randomization 
test.  The  power  of  this  test  will  be  discussed  later  in 
this  chapter. 

The  following  example  will  help  illustrate  tha  metho¬ 
dology.  Consider  the  regression  model 
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where  (Y,X1#X2)'  is  distributed  as  MVN(0 ,J) .  There  are 

n  ■  4  observations  and  p  ■  2  possible  predictors.  All 

possible  regressions  are  calculated.  The  maximum  adjusted 
2 

R  criterion  leads  to  the  prediction  equation 


y  »  -2.09  ♦  3.19  x2 

with  R2  »  .92.  Using  this  criterion  to  select  a  "best" 

subset  for  each  of  the  24  possible  permutations  of  the  y 

values  yields  the  values  of  R  shown  in  Figure  7.  Note  that 
2  2  2 

R*  *  R  (IS);  that  is,  R*  is  the  15th  order  statistic.  If 
2 

a  value  of  R  is  chosen  at  random  from  the  set  R. 

Prob(R2  >  rJ  |  Ho  true)  •  9/24. 


This  result  is  compatible  with  Ho  and  gives  little  evidence 
to  indicate  that  Ho  is  false. 


i 
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Figure  7:  Plot  of  R(i)  in  the  Set  R 


The  statistical  test  illustrated  in  the  above  example 
is  an  exact  test.  Unfortunately,  this  test  is  not  practi¬ 
cal  as  a  result  of  the  large  number  of  possible  permutations 
that  are  required  even  for  small  values  of  n.  For  a  sample 
of  n  ■  25  observations,  N  would  be  over  10  .  The  calcu¬ 
lations  associated  with  this  number  of  permutations  make 
this  approach  computationally  infeasible. 

Since  it  is  not  practical  to  determine  the  entire  set 
X,  sampling  schemes  will  be  explored.  Note  that  only  the 
relative  position  of  R£  in  R  is  needed  in  order  to  measure 
the  probability  of  a  Type  I  error  for  the  conditional  test. 
As  a  first  approach  to  assessing  the  extremeness  of  RJ 
relative  to  R  we  consider  a  nonparametric  tolerance 


interval  argument  (22).  Based  on  a  random  sample  of  s 
permutations,  where 
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it  is  known  that 

Prob(100(l-d) I  of  the  R2  in  R  <  R2(s))  >  g 
2  2 

where  R  (s)  is  the  largest  R  value  in  the  sample.  For 
example,  with  g  ■  .99  and  d  ■  .1, 


Thus,  if  44  random  permutations  are  obtained  and  their 
2 

corresponding  R  values  calculated,  then 

Prob(90%  of  the  R2  in  R  <  R2(44))  >  .99. 

2  2 

Comparing  R*  with  R  (44)  gives  an  indication  of  the 

2 

relative  position  of  R*  in  R.  This  could  provide  the  basis 
for  a  decision  rule  (reject  Ho  if  R2  >  R2(44)),  but  the 
significance  level  is  only  loosely  related  to  the  parameters 
g  and  d.  For  this  example,  the  significance  level  would  be 
approximately  . 1 . 

In  order  to  obtain  an  exact  test,  this  approach  must 

2 

be  modified.  Since  extremely  large  values  of  R;  relative 
to  the  set  R  provide  evidence  critical  of  the  null  hypo¬ 
thesis,  the  following  decision  rule  is  appealing:  reject 
Ho  if  rJ  >  R2(s)  where  R2(s)  is  the  largest  R2  value  in  a 
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sample  resulting  from  s  random  permutations.  How  large  must 

s  be  in  order  that  the  test  have  a  significance  level  of  a? 

If  Ho  is  true,  R*  is  just  as  likely  to  be  any  of  the  (s+1) 

2 

observed  R  values.  That  is, 

Prob (R*  >  R2(s)  |  Ho  is  true)  -  l/(s+l). 


A  level  a  test  is  obtained  by  taking  s  permutations  of 
the  original  y  values  where  s  is  the  smallest  integer  greater 
than  or  equal  to  (1-cO/ct.  For  example,  for  o  ■  .05,  s  ■  .95/. 05 
•  19  permutations  must  be  used. 

The  determination  of  the  power  that  the  permutation 
test  will  achieve  against  various  alternatives  is  a  diffi¬ 
cult  problem  and  remains  unsolved.  A  simulation  is  employed 
to  compare  the  power  of  the  new  test  to  that  of  the  F  test 
in  some  situations  where  the  latter  is  valid.  In  these 
situations,  the  F  test  is  optimal  (1).  But  if  the  new  test 
has  comparable  power,  it  will  provide  a  alternative  to  the 
F  test  that  is  valid  under  a  wider  set  of  conditions,  namely, 
when  variable  selection  techniques  are  employed.  In  par¬ 
ticular,  random  samples  of  size  30  are  generated  on  the 
vector  (Y,XltX2,. . . ,Xg) '  which  has  a  6-dimensional  multi¬ 
variate  normal  distribution  with  mean  vector  0  and 
covariance  matrix  £.  The  data  is  analyzed  by  fitting  the 
full  model 
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y  «  Bq  +  6ixi  +  @2x2  +  63x3  +  64x4  +  ®5X5  +  e» 

and  no  variable  selection  technique  is  used.  One  hundred 
such  samples  are  generated  and  analyzed  for  each  distri¬ 
bution.  For  each  test,  the  fraction  of  these  samples  which 
resulted  in  HQ  being  rejected  gives  an  indication  of  that 
test's  power. 

The  covariance  matrix  £  is  of  the  form 


The  values  of  and  g  are  chosen  to  give 

prespecified  values  for  the  theoretical  coefficient  of 
determination,  p  ,  and  to  allow  for  various  covariance 
structures  such  as  nonorthogonal  predictor  variables.  It 
is  known  (1)  that  in  these  situations  the  power  of  the  F 
test  depends  on  the  covariance  structure  only  through  the 
value  of  p  .  The  purpose  of  this  simulation  is  to  make  a 
comparative  study  of  the  powers  of  the  permutation  test  and 
F  test  for  the  following  special  covariance  structures. 

Case  1.  g  •  I  and  r^  ■  rj  for  i  ■  1,2,... ,5. 
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1  .9  .9  ...  .9 

1  .9  ...  .9 

•  • 

•  • 

*  • 

1  .9 

1 

4 

and  ■  r£  for  i  *  1,2,. ..,5. 

Case  3.  g  ■  ^  and  r\  u  ri  m  r$  an(*  ri  "  0  ^or 

i  *  3,4,5.  The  values  of  rj,  r£,  and  r£  are  chosen  to 

2 

give  the  desired  p  values. 

The  results  of  this  simulation  appear  in  Table  III. 

The  fraction  rejected  by  both  tests  is  given  for  various 
2 

p  values  for  each  of  the  three  cases.  The  theoretical 
power  of  the  F  test  also  appears  in  the  table.  While  these 
theoretical  values  are  listed,  the  actual  fractions  rejected 
by  the  F  test  are  included  to  give  a  better  basis  for  com¬ 
paring  the  corresponding  fraction  rejected  by  the  permuta¬ 
tion  test  based  on  the  same  data. 


Case  2. 

fi  "  fil  “ 


Fraction  Rej 

TABLE  III 

ected  at  .05 

• 

Significance 

Level 

p2 

Actual 
Power  of 

Case 

1 

Case 

2 

Case 

3 

F  test 

Perm. 

F 

Perm. 

F 

Perm. 

F 

0 

.050 

.06 

.06 

.05 

.07 

.05 

.04 

.1 

.196 

.15 

.18 

.20 

.19 

.21 

.23 

.2 

.418 

.36 

.44 

.33 

.39 

.39 

.42 

.3 

.657 

.55 

.67 

.54 

.61 

.60 

.69 

.4 

.847 

.78 

.85 

.75 

.83 

.80 

.87 

.5 

.954 

.90 

.95 

.91 

.94 

.89 

.96 

.6 

.992 

.97 

.98 

1.00 

.99 

1.00 

1.00 

.7 

.999 

1.00 

.99 

1.00 

1.00 

1.00 

1.00 

.8 

.999+ 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 

.9 

.999+ 

1.00 

1.00 

1.00 

1.00 

1.00 

1.00 
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While  the  power  of  the  permutation  test  cannot  be  expected 

to  match  that  of  the  F  test,  these  figures  offer  evidence 

? 

that  it  performs  surprisingly  well.  For  fixed  p  values, 
the  covariance  structure  appears  to  have  no  significant 
effect  on  the  power  of  the  permutation  test  although  that 
conjecture  remains  an  open  question.  To  obtain  more 
numerical  insight  into  that  question,  a  larger  scale  simu- 
lation  was  performed  for  p  ■  .4  by  running  6  independent 
replications  of  size  100  for  each  covariance  structure. 


Fraction  Rejected 

TABLE  IV 

at  .05  Significance 

Level 

Replication 

for  p£  ■ 

1  2 

.4  -  Additional 

3  4  5 

Data 

6 

Mean 

Stand . 
dev. 

Case  1 

.71  .82 

.77  .78 

.76 

.84 

.78 

.046 

Case  2 

.70  .79 

.82  .71 

.81 

.74 

.76 

.051 

Case  3  • 

.81  .78 

.80  .82 

.70 

.74 

.775 

.046 

These  results  certainly  strengthen  the  credibility  of  the 

conjecture  that  the  power  of  the  permutation  test  depends 

2 

on  the  covariance  structure  only  through  the  value  of  p 
when  no  variable  selection  technique  is  used. 

These  results  give  some  indication  of  the  relative  per¬ 
formance  of  the  permutation  and  F  tests  in  situations  where 
the  F  test  is  valid.  While  the  exact  power  function  of  the 
F  test  is  known  (1),  the  mathematically  untractable  power  func¬ 
tion  of  the  permutation  test  necessitates  this  Monte-Carlo 
approach.  In  Chapter  V,  the  application  of  this  test  to  pro- 
lems  of  interest  to  management  science  will  be  investigated. 


CHAPTER  IV 


POWER  TRANSFORMATIONS  OF 
BIVARIATE  SAMPLES 

In  the  analysis  of  data  it  is  often  necessary  to  use  a 
power  transformation  to  model  the  relationship  between  two 
variables.  An  appropriate  transformation  may  be  suggested 
by  economic  theories,  physical  properties,  or  other  such 
underlying  considerations  of  the  relation  being  studied. 

On  the  other  hand,  there  may  be  an  absence  of  any  such  firm 
theoretical  or  even  historical  indications.  Upon  inspection 
of  the  scatter  diagram,  it  may  be  obvious  that  a  linear 
relationship  between  the  two  variables  is  not  appropriate. 
For  these  situations,  Mosteller  and  Tukey  (19)  suggest 
considering  a  re-expression  of  one  or  both  of  the  variables 
so  that  the  resulting  relationship  is  more  nearly  linear. 
They  suggest  a  re-expression  of  the  form  (y+c)p  where  c  and 
p  are  constants.  According  to  Mosteller  and  Tukey,  the 
value  of  c  is  often  zero  and  the  most  commonly  used  powers 
are  p  ■  1/2,  p  ■  -1,  and  p  •  1/3  in  descending  frequency 
of  use. 

To  aid  practitioners  in  the  selection  of  possible  p 
values,  they  offer  a  rule  of  thumb  called  the  "bulging 
rule."  Using  the  scatter  diagram  in  accordance  with  this 
rule  indicates  what  values  of  p  should  be  considered.  For 
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the  original  y  values,  p  is  equal  to  1.  From  this  value  of 
p,  the  fundamental  rule  is  to  move  on  the  "ladder"  of  pos¬ 
sible  values  of  p  in  the  direction  in  which  the  bulge  of  the 
scatter  plot  points.  Figure  8  illustrates  how  to  use  this 
ladder  of  powers  to  aid  in  the  re-expression  of  y  for  four 
kinds  of  bulging.  If  the  scatter  plot  resembles  curve  (a), 
movements  up  the  latter  and  values  of  p  larger  than  1  should 
be  considered.  The  "bulging  rule"  may  also  be  used  to  indi¬ 
cate  appropriate  values  for  power  transformations  of  x  as 
illustrated. 


y  up 
x  up 

x  up 
y  down 


Figure  8:  The  Bulging  Rule 


The  purpose  of  this  chapter  is  to  report  the  results  of 
a  study  of  the  influence  of  this  power  transformation  on  the 
sample  linear  correlation  between  the  two  variables  when  the 
values  of  c  and  p  are  empirically  determined  in  such  a  way 
as  to  maximize  the  sample  linear  correlation  R.  Maximizing 
R  by  empirically  determining  c  and  p  does  not  coincide  with 
Tukey's  notion  of  "straightening  out"  the  data.  The  focus 
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of  this  investigation  is  the  possibility  of  artificially 
high  values  for  R  when  the  true  linear  correlation  is  zero. 
The  motivation  for  this  study  was  the  question,  "Is  it 
possible  to  significantly  increase  the  sample  linear  corre¬ 
lation  between  two  variables  by  considering  power  trans¬ 
formations  of  the  dependent  variable  when  the  variables  are, 
in  fact,  independent?" 

A  simulation  is  performed  in  an  attempt  to  answer  this 
question.  Samples  of  size  10  are  generated  from  a  bivariate 
normal  population  with  mean  vector  (10,  10)'  and  covariance 
matrix  J.  A  mean  of  10  and  a  standard  deviation  of  1  are 
used  to  insure  positive  values  for  y  since,  according  to 
Mosteller  and  Tukey,  y  should  represent  an  amount  or  count 
if  the  re-expression  (y+c)p  is  to  be  used.  As  a  result  of 
this  covariance  structure,  the  theoretical  linear  corre¬ 
lation  is  zero.  Let  R*(c,p)  denote  the  sample  correlation 
between  (y+c)p  and  x.  Note  that  R*(c,l)  »  R  for  all  c. 

The  values  of  c  and  p  are  determined  via  a  two-dimensional 

optimizing  program  in  such  a  way  as  to  maximize  the  value 

*  * 

of  R  (c,p).  Let  R  denote  this  maximum  value.  A  number  of 
such  samples  are  considered  with  R  and  R*  values  calculated 
for  each  sample.  A  scatter  plot  of  these  pairs  is  given 
in  Figure  9.  For  certain  samples,  the  sample  linear 
correlation  is  substantially  increased. 
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Figure  9:  Scatter  Plot  of  30  Standardized  Values  of  R  vs  R 

A  sinple  linear  regression  of  R*  on  R  is  performed  in 
order  to  investigate  their  relationship  further.  The 
resulting  estiaate  of  the  slope  is  1.27  which  is  found  to 
be  significantly  greater  than  1  at  the  .05  level.  Figure  10 
gives  a  plot  of  the  absolute,  standardized  values  of  the 
residuals  which  deserves  soae  attention.  Note  that  the 
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Examination  of  the  simulation  results  reveals  the  fact 

that  the  optimal  transformations  seem  to  cluster  into  two 

categories.  For  each  of  these  categories,  the  optimal  value 

of  c  tends  to  be  the  negative  of  the  smallest  y  observation. 

The  optimal  p  values  are  either  in  the  range  1.5-4.S  or  the 

search  algorithm  fails  to  identify  an  optimal  p  value.  In 

* 

the  latter  case,  R  (-y(l)  ,p)  increases  as  p  is  decreased 
toward  zero.  Examination  of  scatter  plots  reveals  the 
reason  for  this  phenomenon.  It  is  related  to  the  so-called 
"lollipop  effect"  whose  name  will  become  clear  shortly. 
Recall  that 


57T  J/V*) 


(ct  J/V*'2}  {stt  Jl(yi‘7)2j 


T 


s  (x  ,y) 

imztyj 


Obviously,  if  there  is  no  linear  relationship  between  x  and 
y,  s(x,y)  would  be  expected  to  be  close  to  zero.  Since  only 
transformations  of  y  are  being  considered,  the  x  values  and, 
thus,  s(x)  are  fixed.  Therefore,  a  transformation  of  y 
yields  a  variable  y*  that  will  be  more  linearly  correlated 
with  x  than  y  is  if  the  resulting  ratio  s(x,y*)/s(y*)  is 
larger  than  s(x,y)/s(y).  When  can  such  a  possible  transfor¬ 
mation  be  expected  to  exist?  The  answer  lies  in  the  tnaly- 
sis  of  the  scatter  plot  of  the  (x,y)  pairs.  If  an  extreme 
value  of  y  is  associated  with  an  extreme  value  of  x,  then  it 
is  possible  to  use  a  transformation  of  the  form  y*  -  (y+c)** 
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to  substantially  increase  the  sample  linear  correlation  as 
shall  be  seen  with  the  help  o£  an  example. 

Table  V  gives  a  set  of  x  and  y  pairs  resulting  from 
a  typical  simulation  run.  Note  that  the  smallest  value  of 
y  »  8.79  is  paired  with  x  ■  8.89,  the  third  smallest  x. 

Also,  the  deviation  of  this  x  from  the  mean,  x,  is  -.96, 
the  third  largest  in  absolute  value.  Some  of  the  initial 
statistics  are  R  ■  .11,  s(x,y)  ■  .109,  s(y)  -  1.25  and 
y  ■  10.51.  A  transformation  with  parameter  values  as  des¬ 
cribed  for  the  second  type  of  optimality  is  found  using  the 
search  algorithm.  That  is,  the  smallest  y  value  (8.79)  is 
subtracted  from  each  of  the  y  observations,  and  the  resulting 
differences  raised  to  the  power  p  »  0.  These  values  are 
given  in  Table  V.  Note  that  s(y*)  •  .317  is  much  smaller 
than  s(y)  *  1.25.  Thus,  the  influence  of  the  transformation 
on  the  value  of  s(x,y*)  will  determine  if  the  sample  linear 
correlation  is  increased.  Note  that  s(x,y*)  is  a  weighted 

sum  of  deviations  of  x  values  about  their  mean.  These 

*  _* 

weights  are  the  differences  (y^  -  y  ) .  As  a  result  of  the 

transformation,  these  deviations  are  small  except  for 
*  _* 

(Yl  -  y  )  *  -.90.  The  transformation  reduces  the  varia¬ 
bility  of  the  dependent  variable  and  associates  with  a  large 
x  deviation  the  largest  y*  deviation,  which  yields 
s(x,y*)  ■  .106.  As  a  result  of  this  transformation,  R  is 
increased  from  .11  to  R*  ■  .41. 
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TABLE  V. 

Values  from  Simulation  Run 

* 

y 

X 

y 

8.79 

8.89 

0 

8.93 

9.80 

l 

11.06 

10.77 

l 

12.21 

10.33 

l 

10.68 

10.51 

i 

10.96 

9.70 

l 

10.15 

8.58 

l 

10.44 

10.67 

i 

12.46 

8.86 

l 

9.37 

10.36 

i 

* 

y  -  10.51 

x  -  9.85 

y  -  .90 

s(y)  -  1.25 

s(x)  -  .82 

s(y*)  -  .317 

s(x,y)  ■  .109  s(x,y*)  •  .106 

R  -  .11  1 

l  -  .41 

This  example  illustrates  the  "lollipop  effect."  The 
name  is  a  result  of  the  appearance  of  the  scatter  diagram 
of  the  (x,y*)  pairs  (see  Figure  11).  The  optimal  trans¬ 
formation  isolates  one  point  while  grouping  the  remaining 
points  giving  the  data  set  a  lollipop  appearance. 
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Figure  11:  Scatter  Plot*of  10  Standardized  Values 

of  Y  vs.  X 

It  has  been  shown  that  it  is  possible  to  inflate  the 
value  of  R  using  an  empirically-determined  power  transfor¬ 
mation  when  the  two  variables  are  actually  independent. 
However,  this  "inflation"  should  not  go  undetected.  A  resi¬ 
dual  plot,  such  as  Figure  12  for  the  above  example,  indicates 


Residuals 


this  lollipop  effect.  Upon  inspection  of  such  residuals, 
the  experienced  data  analyst  should  not  fail  to  realize  the 
reason  for  this  anomaly  and  reject  the  method  summarily. 
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For  some  samples,  it  has  bean  shown  that  the  "lollipop 
effect"  creates  a  substantially  increased  value  of  the 
sample  linear  correlation.  Even  samples  from  correlated, 
bivariate  populations  may  have  the  potential  for  this 
phenomenon.  The  following  theorem  gives  insight  into  this 
potential  for  any  bivariate  sample. 

Theorem:  For  a  bivariate  sample  of  size  n,  (y^x^)', 

cy2»x2),»*-*»(yn»xn),»  let  yj  be  the  smallest  y  valu®  in 

*  * 

the  sample  and  let  y  be  transformed  to  y  so  that  •  1 
for  i  ■  l,2,...,n,  i  j4  j  and  y?  *  0.  Then  the  value  of 
the  simple  linear  correlation,  R*,  between  y*  and  x  is 
given  by 

*  ^n(x-x-) 

R  “  (n-DTTxy 


and  the  estimate  of  the  parameters  in  the  regression  of  y* 
on  x,  given  by 


* 

y  »  bQ  ♦  bxx  +  e, 
are 

/S  x-x . 

b.  ■  - * - j  and 

1  (n-l)(s(x))z 


A 


n-1 

n 


bj?. 
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Proof: 


Obviously,  y 


(n-l)/n.  Thus 


(IL1I)2  .  (0  -  12^1)2) 

n  1  n  1  1 


1 

n 


Therefore, 


R 


* 


*  Cx.-x)(0  - 


s(x)  /l/n 


(n-lK 
n  J 


-  ^  x~xj 

(n-1)  six!  * 

Furthermore , 

T  *  sfv*l  (x-x.) 

b,  -  R  *  - — -1 - r 

1  s°°  (n-1) (s(x) ) Z 

and 

bQ  ■  y*  -  bxx  *  (n-l)/n  -  bjX.  Q.E.D. 

If  the  sample  deviation,  s(x),  is  ’’small"  relative  to 

the  deviation  from  the  mean  of  the  x  value  corresponding  to 

* 

the  smallest  y  observation,  then  the  resulting  R  may  be 
substantially  inflated.  Two  examples  will  help  illustrate 
this  point. 
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In  the  first  example,  a  sample  of  size  10  is  taken  from 
a  bivariate  normal  population  with  mean  vector  (10,  10)'  and 
covariance  matrix 

* 

1  .5 

.5  1 

l  J. 

A  scatter  plot  of  these  points  appears  in  Figure  13. 

The  theoretical  correlation  is  .5,  and  the  sample  corre¬ 
lation  is  .66.  Note  that  the  smallest  value  of  y  which  is 
8.54  is  associated  with  the  smallest  x  value  which  is  7.04. 
Thus,  an  inflated  value  of  R  is  expected.  Using  the 
above  theorem,  we  have 
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Figure  13:  Scatter  Plot  of  10  Standardized  Values 

Y  vs.  X 
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Figure  14  is  the  scatter  plot  of  10  observations  from 
a  bivariate  normal  population  with  mean  vector  (10,  10)' 
and  covariance  matrix 

*  « 

1  .8 

.8  1 

The  theoretical  correlation  is  .8,  and  the  sample  cor¬ 
relation  is  .82.  Note  that  the  smallest  y,  9.36,  is  paired 
with  an  x,  9.67,  which  is  near  its  sample  mean  of  10.18. 

Thus,  the  above  transformation  might  result  in  a  small  value 
* 

of  R  as  seen  by 

_  /HT  (10.18  -  9.67)  _ 

k  •  — i — f9  yv~"yyy-""  f  .34. 
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Figure  14:  Scatter  Plot  of  10  Standardized  Values 

Y  vs.  X 


The  use  of  empirically-determined  power  transformations 
may  lead  to  inflated  values  of  the  sample  linear  corre¬ 
lation.  The  potential  for  this  phenomenon  has  been  demon¬ 
strated  for  independent  and  correlated  bivariate  samples. 
However ,  the  fact  that  the  value  is  inflated  should  not  go 
undetected.  An  inspection  of  the  resulting  residuals  should 
aid  the  data  analyst  in  spotting  this  "lollipop  effect". 


CHAPTER  V 


APPLICATIONS  OF  PERMUTATION  TEST 

In  this  chapter  the  applicability  of  the  statistical 
procedures  developed  previously  to  problems  of  parametric 
cost  estimation  is  illustrated.  Parametric  cost  estimation 
is  a  management  tool  used  to  aid  in  the  prediction  of  the 
cost  of  a  proposed  system.  It  involves  predicting  the  cost 
(dependent  variable)  of  a  system  by  means  of  explanatory 
(independent)  variables  such  as  system  characteristics  or 
performance  requirements.  This  procedure  is  based  on  the 
premise  that  the  cost  of  a  system  is  related  in  a  quanti¬ 
fiable  way  to  the  system's  physical  and  performance  charac¬ 
teristics  (14).  The  expression  of  this  quantifiable 
relationship  is  in  the  form  of  an  estimating  equation  de¬ 
rived  through  statistical  regression  analysis  of  historical 
cost  data  on  systems  which  are,  more-or-less,  analogous  to 
the  proposed  system.  Since  parametric  cost  estimates  can  be 
developed  during  the  concept  formulation  stage  of  the  acqui¬ 
sition  process  before  engineering  plans  are  finalized,  these 
estimates  can  be  used  by  management  to  (14) : 

1.  Identify  possible  cost/performance  tradeoffs 
in  the  design  effort. 

2.  Provide  a  basis  for  cost/effectiveness  review 
of  performance  specifications. 

3.  Provide  information  useful  in  the  ranking  of 
competing  alternatives. 
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4.  Suggest  a  need  for  investigating  new 
alternatives . 

In  particular,  examples  of  parametric  cost  estimation 
for  Navy  weapon  systems  will  be  considered.  Cost  overruns 
have  been  prevalent  in  the  acquisition  of  new  weapon  systems 
making  cost  estimation  a  very  important  problem  for  all 
components  of  the  Department  of  Defense.  These  overruns 
result  in  very  difficult  budget  decisions  and  a  decrease 
in  the  Congress's  confidence  in  the  managerial  ability  of 
military  leaders.  For  fiscal  year  1971,  the  Navy  experi¬ 
enced  a  cost  growth  of  $19  billion  on  24  weapon  systems; 

15%  of  this  cost  growth  was  attributed  to  poor  initial  cost 
estimates  (14).  Historically,  the  Navy  has  used  industrial 
engineering  techniques  to  develop  estimates  of  the  cost  of 
a  proposed  system.  These  techniques  required  detailed 
studies  of  the  operations  and  materials  required  to  produce 
the  new  system.  Although  a  great  deal  of  time  and  effort 
is  required  to  produce  these  estimates,  there  is  considerable 
uncertainty  remaining  as  evidenced  by  the  overruns  mentioned 
above.  In  addition,  slight  design  changes  can  vitiate  the 
estimate  and  neccessitate  a  complete  restudy.  To  help  im¬ 
prove  such  performance,  the  Department  of  Defense  has 
issued  directives  to  all  branches  of  the  service  to  employ 
independent  parametric  cost  estimation.  Publications  such 
as  (14)  have  appeared  which  give  step  by  step  methodology 
for  the  development  of  a  parametric  cost  estimate. 
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Regression  problems  faced  by  costing  and  pricing  analysts 
in  these  situations  are  inherently  difficult  for  two  funda¬ 
mental  reasons  (20): 

1.  The  number  of  observations  is  usually  small 
compared  with  the  number  of  system  character¬ 
istics  which  are  candidate  components  of  the 
regression  equation. 

2.  The  available  data  is  not  produced  by  employing 
an  efficient  experimental  design ,  but  by  what 
Box  (4)  calls  "unplanned  happenings." 

Under  these  circumstances,  it  has  been  shown  that  the 

use  of  variable  selection  techniques  may  result  in  regres- 

2 

sion  equations  which  yield  inflated  R  values  whose  statis¬ 
tical  significance  cannot  be  tested  using  the  F  test. 

One  approach  for  the  development  of  a  parametric  cost 
estimate  involves  breaking  the  system  up  into  component 
subsystems  and  using  a  separate  model  to  estimate  the  cost 
of  each  component.  This  process,  called  disaggregation  (14), 
will  generally  result  in  better  subsystem  cost  estimates, 
and  if  these  estimates  are  independent,  a  combined  estimate 
of  system  cost  can  be  obtained  in  the  obvious  way.  For 
example,  a  cost  estimate  may  be  desired  for  the  construction 
of  a  new  submarine  under  consideration.  A  possible  compo- 
net  subsystem  would  be  its  sonar  system.  A  cost  estimate 
of  this  subsystem  might  be  based  on  a  model  with  such  can¬ 
didate  predictors  as  weight  and  volume  of  the  internal 
electronics,  number  of  hydrophone  amplifiers,  power  output, 
sensitivity,  the  year  that  the  sonar  system  became  fleet 
operational,  etc.  Total  system  cost  is  then  estimated  by 
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reaggregating  subsystem  estimates.  The  determination  of  a 
confidence  interval  for  the  total  system  cost  is  a  difficult 
problem  and  remains  unsolved.  The  difficulty  is  a  result 
of  the  lack  of  understanding  of  the  effect  of  interactions 
among  the  subsystems  on  the  factors  influencing  total  cost. 
The  development  of  cost-estimating  models  for  missile  sub¬ 
systems  will  be  explored  via  data  obtained  from  the  Naval 
Weapons  Center  at  China  Lake,  California.  The  data  has 
been  sanitized  for  security  reasons  without  destroying  the 
relationships  between  variables. 

Table  VI  presents  historical  data  on  the  cost  and 
relevant  performance  characteristics  of  a  certain  type  of 
system  which  we  shall  designate  Subsystem  A.  Presumably, 
values  of  XltX2,...,X7  of  a  proposed  system  will  be  sub¬ 
stituted  into  the  prediction  equation  developed  for  the 
data  in  Table  VI  in  order  to  produce  a  cost  estimate  of 
the  proposed  system.  As  is  typical  for  parametric  costing 
problems,  the  number  of  observations  available  is  not  large 
compared  to  the  number  of  candidate  predictors.  Here, 
there  are  8  observations  on  the  cost  and  7  system  charac¬ 
teristics.  With  the  information  provided  in  this  data, 
we  want  to  determine  the  performance  characteristics  which 
provide  a  model  that  will  best  estimate  the  cost  of  the 
proposed  subsystem. 


TABLd 

VI. 

Cost 

Data  for  Subsystem  A 

X1 

X2 

*5 

X4 

X5 

*6 

X7 

Y 

1.09 

2.06 

0.41 

2.48 

1.08 

0.00 

0.00 

0.00 

1.09 

2.06 

0.41 

2.48 

2.17 

0.03 

0.00 

0.02 

1.09 

2.06 

0.41 

2.48 

2.17 

0.03 

0.00 

0.04 

0.00 

0.21 

0.00 

0.00 

0.00 

2.12 

0.S6 

0.78 

0.00 

0.62 

0.12 

0.19 

0.00 

2.22 

0.64 

0.65 

2.38 

0.00 

1.39 

1.24 

0.54 

1.89 

2.08 

2.47 

2.38 

0.00 

1.39 

1.24 

2.17 

1.84 

2.08 

1.96 

2.38 

0.00 

2.99 

1.24 

2.17 

0.91 

2.17 

1.94 

A  stepwise  regression  algorithm  is  applied  to  the  data 
yielding: 

1.  the  best  single.- variable  model 

y  «  .06  ♦  .98  xj,  (1) 

2.  the  best  2 -variable  aodel 

y  *  .20  -  .12  Xg  ♦  .99  (2) 

The  R2  associated  with  aodel  (1)  is  .964  and  that  with 
aodel  (2)  is  .978.  Thus,  the  data  analyst  might  consider 
using  the  single-variable  aodel  to  obtain  a  cost  estiaate 
since,  as  aentioned  in  a  previous  chapter,  the  variance  of 
prediction  cannot  be  reduced  by  adding  variables  to  the 
regression  equation.  The  standard  F  test  applied  to  this 
aodel  yields  a  highly  significant 
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F  -  -  159.57. 

(l-RZ)p 

Having  shown  that  the  use  of  variable  selection  tends 
to  inflate  the  value  of  the  F  statistic,  we  consider  the 
pernutation  test.  Since  a  significance  level  of  .05  is 
desired,  19  random  permutations  must  be  used.  For  each  per¬ 
mutation,  the  stepwise  algorithm  is  used  to  determine  the 

2 

best  single-variable  model.  The  R  value  associated  with 
this  model  is  saved.  Recall  that  the  rejection  rule  is  to 
reject  Ho:  p2  ■  0  if  R2,  which  is  .964,  exceeds  R2(19)  where 
R  (19)  is  the  largest  R  observed  in  the  sample  of  permu¬ 
tations.  Figure  15  gives  a  stem  and  leaf  display  of  the 
2 

20  R  values. 

2 

Note  that  the  largest  sample  value  of  R  is  .897. 

Thus,  R2  >  R2(19),  Ho  is  rejected,  and  it  is  concluded  that 
the  single- variable  model  is  significant  at  the  .05  level. 

A  cost  estimate  for  the  proposed  system  is  obtained  by 
evaluating  this  single-variable  model  at  the  value  X7  of 
the  proposed  system. 


1.0 

2. 

0.9 

64 (-R;) 

0.8 

97 

0.7 

0.6 

13,92 

0.5 

13,21 

0.4 

12,57,64 

0.3 

09,43 

0.2 

22,26,27,35 

0.1 

42,55,66 

0.0 

25,68 

2 

Figure  IS:  Stem  and  Leaf  of  R  Values  for  Subsystem  A 

Historical  information  for  systems  similar  to  a  pro¬ 
posed  system  designated  as  Subsystem  B  appears  in  Table  VII. 
Six  observations  ere  supplied  on  the  cost  of  the  system  and 
7  of  its  operating  characteristics.  Again  a  step-wise 
algorithm  is  applied  to  the  data,  and  it  yields  the 
following  models: 

1.  y  -  -.23  ♦  .88  Xj,  (3) 

2.  y  -  -1.19  ♦  .96  Xj  ♦  .46  x7.  (4) 

2 

The  R  values  for  models  (3)  and  (4)  are  .768  and  .978, 
respectively.  The  two-variable  model  appears  to  do  the 
better  job,  but  both  will  be  analyzed. 
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TABLE 

VII. 

Cost 

Data  for  Subsystem  B 

X1 

X2 

X3 

X4 

X5 

X6 

X7 

Y 

0.00 

0.00 

0.72 

0.55 

1.95 

2.94 

2.04 

0.00 

0.79 

1.57 

0.72 

0.00 

0.24 

1.56 

2.04 

0.42 

0.79 

1.57 

0.72 

0.00 

0.24 

1.56 

2.04 

0.37 

2.91 

1.57 

2.40 

0.00 

2.15 

1.10 

2.26 

2.74 

0.79 

2.32 

2.40 

0.34 

0.00 

0.00 

2.99 

0.85 

1.58 

0.17 

0.00 

2.56 

0.04 

0.64 

0.00 

0.28 

The  F  test  yields:  for  (3),  F  ■  13.33  >  F(.0S,1,4)  - 

7.71  and  for  (4)  F  -  66.93  >  F(.0S,2,3)  «  9. 55.  Therefore, 

both  models  appear  to  be  significant  at  the  .05  level.  Once 

again  the  permutation  test  is  applied  using  19  random  per* 

mutations  of  the  cost  values.  Figures  16  and  17  present 

2 

stem  and  leaf  histograms  for  the  R  values  associated  with 
the  best  one-  and  two-variable  models,  respectively. 
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Figure  16:  Stem  and  Leaf  of  R2  Values  for 

One-variable  Models 
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Figure  17 :  Stem  and  Leaf  of  R  Values  for 

Two-variable  Models 

First  of  all,  consider  the  results  for  the  one-variable 

2 

model.  The  largest  value  of  R  in  the  sample  is  .945,  which 

prohibits  the  rejection  of  HQ.  Also,  note  that  the  value  of 
2 

R  associated  with  the  unpermuted  data,  .768,  is  surpassed 

2 

by  a  number  of  other  sample  R  values.  This  tends  to  give 

more  evidence  that  the  one- variable  model  given  by  (3)  is 

not  statistically  significant.  For  the  two- variable  model, 

2 

the  largest  sample  value  of  R  is  .989.  Once  again  Ho 
cannot  be  rejected.  The  two-variable  model  appears  not  to 
be  statistically  significant.  Based  upon  these  results, 
the  analyst  would  conclude  that  none  of  the  proposed  models 
provide  a  statistically  significant  fit  for  the  cost  of  this 
subsystem. 

In  general  parametric  cost  estimation,  a  researcher 
should  not  blindly  trust  the  regression  equation  resulting 
for  his  analysis.  To  measure  the  "goodness  of  fit",  the 


63 


2 

analyst  can  use  such  statistics  as  R  and  F.  However,  there 
are  few  hard  and  fast  rules  for  assessing  the  usefulness  of 
such  a  model.  This  is  especially  true  of  models  that  result 
from  the  application  of  a  variable  selection  technique.  The 
R  and  F  statistics  in  these  situations  may  not  give  a  meaning¬ 
ful  indication  of  the  model's  applicability.  More  than  just 
a  model's  statistics  are  needed  if  an  analyst  is  to  be 
satisfied  that  a  model  will  accurately  predict  the  system's 
cost.  By  obtaining  a  good  knowledge  of  the  kind  of  equip¬ 
ment  with  which  he  is  dealing  --  its  characteristics,  the 
state  of  its  technology  and  the  available  data,  the  analyst 
will  be  able  to  develop  a  particular  model  structure  based 
on  sound  technological  reasoning. 

In  situations  where  a  variable  selection  technique  is 
applied  to  the  data  to  obtain  a  "best"  prediction  equation, 
the  permutation  test  can  aid  the  researcher  in  the  eval¬ 
uation  of  this  model.  It  allows  the  analyst  to  perform  a 
valid  test  of  hypothesis  of  the  statistical  significance 
of  the  particular  model  structure.  In  situations  such  as 
that  demonstrated  for  Subsystem  B,  where  the  test  indicates 
that  the  model  is  not  statistically  significant,  the  data 
available  is  such  that  the  possibility  of  chance  correlation 
is  likely.  Possible  recourses  that  may  be  useful: 

1.  Recheck  the  definitions  used  for  the  para¬ 
metric  and  cost  data. 

2.  Collect  more  observations  to  improve  the  data 
base. 
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3.  Validate  any  questionable  data  points  that 
lie  outside  the  expected  range  of  values. 

In  any  event ,  the  permutation  test  is  another  tool  that 

the  researcher  can  use  in  evaluating  the  suitability  of  the 

cost  estimating  equation. 


CHAPTER  VI 


CONCLUSIONS 

In  empirical  model  building ,  unlike  confirmatory  sta¬ 
tistical  inference,  the  situation  of  working  with  a  given 
model  which  possesses  certain  appealing  properties  is  not 
assured.  The  statistical  properties  of  the  process  under 
investigation  are  generally  complex  and  not  well  understood. 
The  researcher  is  forced  to  draw  ad  hoc  inferences  from  what 
is  often  nonexperimental  data.  In  these  situations,  much  of 
the  traditional  theory  is  not  valid.  In  this  disseration 
is  has  been  illustrated  that  pedestrian  use  of  such  tech¬ 
niques  as  variable  selection  and  transformations  may  result 
in  models  whose  R  values  are  misleadingly  large. 

Learner  (12)  considers  the  purpose  of  the  data- dependent 
process  of  selecting  a  statistical  model  to  be  "data- 
mining" :  using  empirical  analysis  to  bring  to  the  surface 
the  nuggets  of  truth  that  may  be  buried  in  a  data  set.  The 
researcher  has  available  a  plethora  of  possible  statistical 
computer  packages  to  help  bring  these  nuggets  to  the  sur¬ 
face.  To  help  distinguish  precious  stones  from  fool's  gold, 
the  researcher  must  depend  on  his  judgment  and  experience 
and  the  extant  statistical  theory.  In  situations  where  a 
variable  selection  technique  has  been  employed,  there  is  a 
paucity  of  viable  statistical  methods  to  aid  in  the 
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assessment  of  the  resulting  model.  Important  methods  such 
as  residual  analysis  and  cross-validation  have  not  been 
considered  in  this  study.  Such  techniques  can  offer  the 
researcher  valuable  information  about  the  model  specifi¬ 
cation.  However,  these  specification  checks  become  less 
effective  when  the  number  of  sample  observations  is  small. 

The  concentration  here  has  been  on  the  investigation 

2 

of  the  statistical  properties  of  R  in  such  situations. 

These  efforts  have  resulted  in  some  theory,  some  informa¬ 
tive  simulations,  and  some  interesting  applications.  A 
statistical  test  for  hypotheses  of  interest  for  models 
resulting  from  selection  techniques  has  been  developed. 

This  result,  the  permutation  test  presented  in  Chapter  III, 
fills  the  void  of  valid  statistical  tests  created  as  a 
result  of  the  use  of  data  analytic  procedures.  In  situations 
where  the  classical  F  test  cannot  be  used,  this  test  gives 
the  researcher  a  method  for  testing  the  significance  of  his 
model.  This  permutation  test  is  actually  an  application  of 
an  old  technique  (Fisher's  randomization  test)  in  an  area 
of  practical  importance  where  theoretical  results  have 
been  elusive. 
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